{
  "cells": [
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "# 在 VTA 上调优卷积神经网络\n",
        "\n",
        "**Author**: [Lianmin Zheng](https://github.com/merrymercy), [Thierry Moreau](https://homes.cs.washington.edu/~moreau/)\n",
        "\n",
        "针对特定加速器设计的自动调优对于任何给定算子获得最佳性能至关重要。此教程，展示了如何在 VTA 上调优整个卷积神经网络。\n",
        "\n",
        "TVM 中 VTA 的算子实现是用模板（template）形式编写的。模板有许多可调旋钮（tunable knob）（平铺因子（tile factor）、虚拟线程（virtual threads）等）。下面将调优神经网络中的所有卷积算子。调优之后，生成日志文件，其中存储所有调优算子的最佳调度参数。当 TVM 编译器编译这些算子时，它将查询这个日志文件以获得最佳的 knob 参数。\n",
        "\n",
        "## 安装依赖\n",
        "\n",
        "要在 `tvm` 中使用 `autotvm` 包，我们需要安装一些额外的依赖项。（如果你使用 python2，将 \"3\" 改为 \"2\"）：\n",
        "\n",
        "```bash\n",
        "pip3 install --user psutil xgboost tornado mxnet requests \"Pillow<7\" cloudpickle\n",
        "```\n",
        "\n",
        "为了使 TVM 在调优期间运行得更快，建议使用 cython 作为 TVM 的 FFI。在 TVM 的根目录下执行（如果使用 python2，则将 \"3\" 改为 \"2\"）：\n",
        "\n",
        "```bash\n",
        "pip3 install --user cython\n",
        "sudo make cython3\n",
        "```\n",
        "\n",
        "现在返回 python 代码。导入包。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 1,
      "metadata": {
        "tags": [
          "remove-cell"
        ]
      },
      "outputs": [],
      "source": [
        "import set_env"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": 2,
      "metadata": {
        "collapsed": false
      },
      "outputs": [
        {
          "ename": "ModuleNotFoundError",
          "evalue": "No module named 'mxnet'",
          "output_type": "error",
          "traceback": [
            "\u001b[0;31m---------------------------------------------------------------------------\u001b[0m",
            "\u001b[0;31mModuleNotFoundError\u001b[0m                       Traceback (most recent call last)",
            "Cell \u001b[0;32mIn[2], line 2\u001b[0m\n\u001b[1;32m      1\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mos\u001b[39;00m\n\u001b[0;32m----> 2\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mmxnet\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mgluon\u001b[39;00m\u001b[38;5;21;01m.\u001b[39;00m\u001b[38;5;21;01mmodel_zoo\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m vision\n\u001b[1;32m      3\u001b[0m \u001b[38;5;28;01mimport\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mnumpy\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mas\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mnp\u001b[39;00m\n\u001b[1;32m      4\u001b[0m \u001b[38;5;28;01mfrom\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;21;01mPIL\u001b[39;00m\u001b[38;5;250m \u001b[39m\u001b[38;5;28;01mimport\u001b[39;00m Image\n",
            "\u001b[0;31mModuleNotFoundError\u001b[0m: No module named 'mxnet'"
          ]
        }
      ],
      "source": [
        "import os\n",
        "from mxnet.gluon.model_zoo import vision\n",
        "import numpy as np\n",
        "from PIL import Image\n",
        "\n",
        "from tvm import topi\n",
        "import tvm\n",
        "from tvm import te\n",
        "from tvm import rpc, autotvm, relay\n",
        "from tvm.contrib import graph_executor, utils, download\n",
        "from tvm.autotvm.measure.measure_methods import request_remote\n",
        "from tvm.autotvm.tuner import XGBTuner, GATuner, RandomTuner, GridSearchTuner\n",
        "\n",
        "import vta\n",
        "from vta.testing import simulator\n",
        "from vta.top import graph_pack"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 编译神经网络\n",
        "\n",
        "从 Gluon 模型使用 Relay 执行特定于 VTA 的编译："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def compile_network(env, target, model, start_pack, stop_pack):\n",
        "    # Populate the shape and data type dictionary\n",
        "    dtype_dict = {\"data\": \"float32\"}\n",
        "    shape_dict = {\"data\": (env.BATCH, 3, 224, 224)}\n",
        "\n",
        "    # Get off the shelf gluon model, and convert to relay\n",
        "    gluon_model = vision.get_model(model, pretrained=True)\n",
        "    mod, params = relay.frontend.from_mxnet(gluon_model, shape_dict)\n",
        "\n",
        "    # Update shape and type dictionary\n",
        "    shape_dict.update({k: v.shape for k, v in params.items()})\n",
        "    dtype_dict.update({k: str(v.dtype) for k, v in params.items()})\n",
        "\n",
        "    # Perform quantization in Relay\n",
        "    # Note: We set opt_level to 3 in order to fold batch norm\n",
        "    with tvm.transform.PassContext(opt_level=3):\n",
        "        with relay.quantize.qconfig(global_scale=8.0, skip_conv_layers=[0]):\n",
        "            mod = relay.quantize.quantize(mod, params=params)\n",
        "\n",
        "    # Perform graph packing and constant folding for VTA target\n",
        "    if target.device_name == \"vta\":\n",
        "        assert env.BLOCK_IN == env.BLOCK_OUT\n",
        "        relay_prog = graph_pack(\n",
        "            mod[\"main\"],\n",
        "            env.BATCH,\n",
        "            env.BLOCK_OUT,\n",
        "            env.WGT_WIDTH,\n",
        "            start_name=start_pack,\n",
        "            stop_name=stop_pack,\n",
        "        )\n",
        "\n",
        "    return relay_prog, params"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 启动 RPC Tracker\n",
        "\n",
        "TVM 使用 RPC 会话与 Pynq 板进行通信。在调优期间，调优器将把生成的代码发送到板上，并度量板上代码的速度。\n",
        "\n",
        "为了扩展调优，TVM 使用 RPC Tracker 来管理多个设备。RPC Tracker 是中心化控制节点。可以在 Tracker 上注册所有设备。例如，如果有 10 块 Pynq 板，可以将它们全部注册到 Tracker 中，并并行运行 10 个度量，从而加速优化过程。\n",
        "\n",
        "要启动 RPC tracker 程序，请在主机上运行此命令。在整个调优过程中都需要此 tracker，所以需要为这个命令打开新的终端：\n",
        "\n",
        "```bash\n",
        "python -m tvm.exec.rpc_tracker --host=0.0.0.0 --port=9190\n",
        "```\n",
        "\n",
        "预期输出为：\n",
        "\n",
        "```bash\n",
        "INFO:RPCTracker:bind to 0.0.0.0:9190\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 注册设备到 RPC Tracker\n",
        "\n",
        "现在可以在 Tracker 上注册设备了。第一步是为 Pynq 设备构建 TVM 运行时。\n",
        "\n",
        "遵循 [VTA：通用张量加速器](https://tvm.apache.org/docs/topic/vta/index.html#vta-index) 在设备上构建 TVM 运行时。然后将设备注册到 Tracker：\n",
        "\n",
        "```bash\n",
        "python -m tvm.exec.rpc_server --tracker=[HOST_IP]:9190 --key=pynq\n",
        "```\n",
        "\n",
        "（将 `[HOST_IP]` 替换为主机的 IP 地址）\n",
        "\n",
        "注册设备后，可以通过查询 rpc_tracker 来确认：\n",
        "\n",
        "```bash\n",
        "python -m tvm.exec.query_rpc_tracker --host=0.0.0.0 --port=9190\n",
        "```\n",
        "\n",
        "例如，如果我们有 6 块 Pynq 板和 11 块树莓派 3B，输出可以是\n",
        "\n",
        "```bash\n",
        "Queue Status\n",
        "----------------------------------\n",
        "key          total  free  pending\n",
        "----------------------------------\n",
        "pynq         6      6     0\n",
        "rpi3b        11     11    0\n",
        "----------------------------------\n",
        "```\n",
        "\n",
        "您可以向跟踪器注册多个设备以加速优化。"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 设置调优选项\n",
        "\n",
        "在调优之前，应该应用一些配置。这里以 Pynq-Z1 板为例。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# Tracker host and port can be set by your environment\n",
        "tracker_host = os.environ.get(\"TVM_TRACKER_HOST\", \"127.0.0.1\")\n",
        "tracker_port = int(os.environ.get(\"TVM_TRACKER_PORT\", 9190))\n",
        "\n",
        "# Load VTA parameters from the 3rdparty/vta-hw/config/vta_config.json file\n",
        "env = vta.get_env()\n",
        "\n",
        "# This target is used for cross compilation. You can query it by :code:`gcc -v` on your device.\n",
        "# Set ``device=arm_cpu`` to run inference on the CPU\n",
        "# or ``device=vta`` to run inference on the FPGA.\n",
        "device = \"vta\"\n",
        "target = env.target if device == \"vta\" else env.target_vta_cpu\n",
        "\n",
        "# Name of Gluon model to compile\n",
        "# The ``start_pack`` and ``stop_pack`` labels indicate where\n",
        "# to start and end the graph packing relay pass: in other words\n",
        "# where to start and finish offloading to VTA.\n",
        "network = \"resnet18_v1\"\n",
        "start_pack = \"nn.max_pool2d\"\n",
        "stop_pack = \"nn.global_avg_pool2d\"\n",
        "\n",
        "# Tuning option\n",
        "log_file = \"%s.%s.log\" % (device, network)\n",
        "tuning_option = {\n",
        "    \"log_filename\": log_file,\n",
        "    \"tuner\": \"random\",\n",
        "    \"n_trial\": 1000,\n",
        "    \"early_stopping\": None,\n",
        "    \"measure_option\": autotvm.measure_option(\n",
        "        builder=autotvm.LocalBuilder(),\n",
        "        runner=autotvm.RPCRunner(\n",
        "            env.TARGET,\n",
        "            host=tracker_host,\n",
        "            port=tracker_port,\n",
        "            number=5,\n",
        "            timeout=60,\n",
        "            module_loader=vta.module_loader(),\n",
        "            # check_correctness=True, # TODO: re-enable when check_correctness works again.\n",
        "        ),\n",
        "    ),\n",
        "}"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "```{admonition} 如何设置调优参数\n",
        ":class: alert alert-info\n",
        "\n",
        "通常，这里提供的默认值工作良好。如果有足够的时间预算，可以将 `n_trial`，`early_stopping` 设置为更大的值，使调优运行时间更长。如果您的设备动力不足或 conv2d 算子过大，请考虑设置较长的超时时间。\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 开始调优\n",
        "\n",
        "现在可以从网络中提取调优任务并开始调优。这里，提供了简单的实用函数来调优任务列表。这个函数只是按顺序对它们进行调优的初始实现。我们将在将来引入更复杂的调优调度器。\n",
        "\n",
        "假设将在 Pynq FPGA 板上进行调优，请确保 ``vta_config.json`` 文件中的 ``TARGET`` 条目设置为 ``pynq``。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "# You can skip the implementation of this function for this tutorial.\n",
        "def tune_tasks(\n",
        "    tasks,\n",
        "    measure_option,\n",
        "    tuner=\"xgb\",\n",
        "    n_trial=1000,\n",
        "    early_stopping=None,\n",
        "    log_filename=\"tuning.log\",\n",
        "    use_transfer_learning=True,\n",
        "):\n",
        "\n",
        "    # create tmp log file\n",
        "    tmp_log_file = log_filename + \".tmp\"\n",
        "    if os.path.exists(tmp_log_file):\n",
        "        os.remove(tmp_log_file)\n",
        "\n",
        "    for i, tsk in enumerate(reversed(tasks)):\n",
        "        prefix = \"[Task %2d/%2d] \" % (i + 1, len(tasks))\n",
        "\n",
        "        # create tuner\n",
        "        if tuner == \"xgb\" or tuner == \"xgb-rank\":\n",
        "            tuner_obj = XGBTuner(tsk, loss_type=\"rank\")\n",
        "        elif tuner == \"xgb_knob\":\n",
        "            tuner_obj = XGBTuner(tsk, loss_type=\"rank\", feature_type=\"knob\")\n",
        "        elif tuner == \"ga\":\n",
        "            tuner_obj = GATuner(tsk, pop_size=50)\n",
        "        elif tuner == \"random\":\n",
        "            tuner_obj = RandomTuner(tsk)\n",
        "        elif tuner == \"gridsearch\":\n",
        "            tuner_obj = GridSearchTuner(tsk)\n",
        "        else:\n",
        "            raise ValueError(\"Invalid tuner: \" + tuner)\n",
        "\n",
        "        if use_transfer_learning:\n",
        "            if os.path.isfile(tmp_log_file):\n",
        "                tuner_obj.load_history(autotvm.record.load_from_file(tmp_log_file))\n",
        "\n",
        "        # do tuning\n",
        "        tsk_trial = min(n_trial, len(tsk.config_space))\n",
        "        tuner_obj.tune(\n",
        "            n_trial=tsk_trial,\n",
        "            early_stopping=early_stopping,\n",
        "            measure_option=measure_option,\n",
        "            callbacks=[\n",
        "                autotvm.callback.progress_bar(tsk_trial, prefix=prefix),\n",
        "                autotvm.callback.log_to_file(tmp_log_file),\n",
        "            ],\n",
        "        )\n",
        "\n",
        "    # pick best records to a cache file\n",
        "    autotvm.record.pick_best(tmp_log_file, log_filename)\n",
        "    os.remove(tmp_log_file)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "注册特定于 VTA 的调优任务："
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [],
      "source": [
        "def register_vta_tuning_tasks():\n",
        "    from tvm.autotvm.task import TaskExtractEnv\n",
        "\n",
        "    @tvm.te.tag_scope(tag=topi.tag.ELEMWISE)\n",
        "    def my_clip(x, a_min, a_max):\n",
        "        \"\"\"Unlike topi's current clip, put min and max into two stages.\"\"\"\n",
        "        const_min = tvm.tir.const(a_min, x.dtype)\n",
        "        const_max = tvm.tir.const(a_max, x.dtype)\n",
        "        x = te.compute(x.shape, lambda *i: tvm.te.min(x(*i), const_max), name=\"clipA\")\n",
        "        x = te.compute(x.shape, lambda *i: tvm.te.max(x(*i), const_min), name=\"clipB\")\n",
        "        return x\n",
        "\n",
        "    # init autotvm env to register VTA operator\n",
        "    TaskExtractEnv()\n",
        "\n",
        "    @autotvm.template(\"conv2d_packed.vta\")\n",
        "    def _topi_nn_conv2d(*args, **kwargs):\n",
        "        assert not kwargs, \"Do not support kwargs in template function call\"\n",
        "        A, W = args[:2]\n",
        "\n",
        "        with tvm.target.vta():\n",
        "            res = vta.top.conv2d_packed(*args, **kwargs)\n",
        "            res = topi.right_shift(res, 8)\n",
        "            res = my_clip(res, 0, 127)\n",
        "            res = topi.cast(res, \"int8\")\n",
        "\n",
        "        if tvm.target.Target.current().device_name == \"vta\":\n",
        "            s = vta.top.schedule_conv2d_packed([res])\n",
        "        else:\n",
        "            s = te.create_schedule([res.op])\n",
        "        return s, [A, W, res]"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "最后，启动调优作业并评估端到端性能。"
      ]
    },
    {
      "cell_type": "code",
      "execution_count": null,
      "metadata": {
        "collapsed": false
      },
      "outputs": [
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Extract tasks...\n"
          ]
        },
        {
          "name": "stderr",
          "output_type": "stream",
          "text": [
            "/media/pc/data/lxw/ai/tvm/xinetzone/__pypackages__/3.10/lib/tvm/target/target.py:280: UserWarning: target_host parameter is going to be deprecated. Please pass in tvm.target.Target(target, host=target_host) instead.\n",
            "  warnings.warn(\n",
            "[21:29:24] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:24] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:24] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:24] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:24] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:24] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:24] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:25] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:25] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:25] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:25] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:25] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:25] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:25] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:25] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n",
            "[21:29:25] /media/pc/data/lxw/ai/tvm/src/tir/transforms/arg_binder.cc:95: Warning: Trying to bind buffer to another one with lower alignment requirement  required_alignment=256, provided_alignment=64\n"
          ]
        },
        {
          "name": "stdout",
          "output_type": "stream",
          "text": [
            "Extracted 10 conv2d tasks:\n",
            "(1, 56, 56, 64, 64, 3, 3, 1, 1, 1, 1)\n",
            "(1, 56, 56, 64, 128, 1, 1, 0, 0, 2, 2)\n",
            "(1, 56, 56, 64, 128, 3, 3, 1, 1, 2, 2)\n",
            "(1, 28, 28, 128, 128, 3, 3, 1, 1, 1, 1)\n",
            "(1, 28, 28, 128, 256, 1, 1, 0, 0, 2, 2)\n",
            "(1, 28, 28, 128, 256, 3, 3, 1, 1, 2, 2)\n",
            "(1, 14, 14, 256, 256, 3, 3, 1, 1, 1, 1)\n",
            "(1, 14, 14, 256, 512, 1, 1, 0, 0, 2, 2)\n",
            "(1, 14, 14, 256, 512, 3, 3, 1, 1, 2, 2)\n",
            "(1, 7, 7, 512, 512, 3, 3, 1, 1, 1, 1)\n"
          ]
        }
      ],
      "source": [
        "def tune_and_evaluate(tuning_opt):\n",
        "\n",
        "    # Register VTA tuning tasks\n",
        "    register_vta_tuning_tasks()\n",
        "\n",
        "    # Perform task extraction on Relay program\n",
        "    print(\"Extract tasks...\")\n",
        "    relay_prog, params = compile_network(env, target, network, start_pack, stop_pack)\n",
        "    mod = tvm.IRModule.from_expr(relay_prog)\n",
        "    tasks = autotvm.task.extract_from_program(\n",
        "        mod,\n",
        "        params=params,\n",
        "        ops=(relay.op.get(\"nn.conv2d\"),),\n",
        "        target=target,\n",
        "        target_host=env.target_host,\n",
        "    )\n",
        "\n",
        "    # filter out non-packed conv2d task\n",
        "    tasks = list(filter(lambda t: len(t.args[0][1]) > 4 and \"conv\" in t.name, tasks))\n",
        "\n",
        "    # We should have extracted 10 convolution tasks\n",
        "    assert len(tasks) == 10\n",
        "    print(\"Extracted {} conv2d tasks:\".format(len(tasks)))\n",
        "    for tsk in tasks:\n",
        "        inp = tsk.args[0][1]\n",
        "        wgt = tsk.args[1][1]\n",
        "        batch = inp[0] * inp[4]\n",
        "        in_filter = inp[1] * inp[5]\n",
        "        out_filter = wgt[0] * wgt[4]\n",
        "        height, width = inp[2], inp[3]\n",
        "        hkernel, wkernel = wgt[2], wgt[3]\n",
        "        hstride, wstride = tsk.args[2][0], tsk.args[2][1]\n",
        "        hpad, wpad = tsk.args[3][0], tsk.args[3][1]\n",
        "        print(\n",
        "            \"({}, {}, {}, {}, {}, {}, {}, {}, {}, {}, {})\".format(\n",
        "                batch,\n",
        "                height,\n",
        "                width,\n",
        "                in_filter,\n",
        "                out_filter,\n",
        "                hkernel,\n",
        "                wkernel,\n",
        "                hpad,\n",
        "                wpad,\n",
        "                hstride,\n",
        "                wstride,\n",
        "            )\n",
        "        )\n",
        "\n",
        "    # We do not run the tuning in our webpage server since it takes too long.\n",
        "    # Comment the following line to run it by yourself.\n",
        "    return\n",
        "\n",
        "    # run tuning tasks\n",
        "    print(\"Tuning...\")\n",
        "    tune_tasks(tasks, **tuning_opt)\n",
        "\n",
        "    # evaluate with tuning history\n",
        "    if env.TARGET != \"sim\":\n",
        "        # Get remote from fleet node\n",
        "        remote = autotvm.measure.request_remote(\n",
        "            env.TARGET, tracker_host, tracker_port, timeout=10000\n",
        "        )\n",
        "        # Reconfigure the JIT runtime and FPGA.\n",
        "        vta.reconfig_runtime(remote)\n",
        "        vta.program_fpga(remote, bitstream=None)\n",
        "    else:\n",
        "        # In simulation mode, host the RPC server locally.\n",
        "        remote = rpc.LocalSession()\n",
        "\n",
        "    # compile kernels with history best records\n",
        "    with autotvm.tophub.context(target, extra_files=[log_file]):\n",
        "        # Compile network\n",
        "        print(\"Compile...\")\n",
        "        if target.device_name != \"vta\":\n",
        "            with tvm.transform.PassContext(opt_level=3, disabled_pass={\"AlterOpLayout\"}):\n",
        "                lib = relay.build(\n",
        "                    relay_prog, target=target, params=params, target_host=env.target_host\n",
        "                )\n",
        "        else:\n",
        "            with vta.build_config(opt_level=3, disabled_pass={\"AlterOpLayout\"}):\n",
        "                lib = relay.build(\n",
        "                    relay_prog, target=target, params=params, target_host=env.target_host\n",
        "                )\n",
        "\n",
        "        # Export library\n",
        "        print(\"Upload...\")\n",
        "        temp = utils.tempdir()\n",
        "        lib.export_library(temp.relpath(\"graphlib.tar\"))\n",
        "        remote.upload(temp.relpath(\"graphlib.tar\"))\n",
        "        lib = remote.load_module(\"graphlib.tar\")\n",
        "\n",
        "        # Generate the graph executor\n",
        "        ctx = remote.ext_dev(0) if device == \"vta\" else remote.cpu(0)\n",
        "        m = graph_executor.GraphModule(lib[\"default\"](ctx))\n",
        "\n",
        "        # upload parameters to device\n",
        "        image = tvm.nd.array((np.random.uniform(size=(1, 3, 224, 224))).astype(\"float32\"))\n",
        "        m.set_input(\"data\", image)\n",
        "\n",
        "        # evaluate\n",
        "        print(\"Evaluate inference time cost...\")\n",
        "        timer = m.module.time_evaluator(\"run\", ctx, number=1, repeat=10)\n",
        "        tcost = timer()\n",
        "        prof_res = np.array(tcost.results) * 1000  # convert to millisecond\n",
        "        print(\n",
        "            \"Mean inference time (std dev): %.2f ms (%.2f ms)\"\n",
        "            % (np.mean(prof_res), np.std(prof_res))\n",
        "        )\n",
        "\n",
        "\n",
        "# Run the tuning and evaluate the results\n",
        "tune_and_evaluate(tuning_option)"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "## 样例输出\n",
        "\n",
        "调优需要编译许多程序并从中提取特性。建议配置高性能的 CPU。下面列出了示例输出。16T CPU 选配 6 块 Pynq 单板，大约需要 2 小时。\n",
        "\n",
        "```{eval-rst}\n",
        ".. code-block:: bash\n",
        "\n",
        "   Extract tasks...\n",
        "   [Warning] Invalid shape during AutoTVM task creation\n",
        "   Extracted 10 conv2d tasks:\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 1, 1, 16, 16), 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 16, 14, 14, 1, 16, 'int8'), (32, 16, 1, 1, 16, 16, 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 1, 1, 16, 16), 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 8, 28, 28, 1, 16, 'int8'), (16, 8, 1, 1, 16, 16, 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 4, 56, 56, 1, 16), 'int8'), ('TENSOR', (8, 4, 1, 1, 16, 16), 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 4, 56, 56, 1, 16, 'int8'), (8, 4, 1, 1, 16, 16, 'int8'), (2, 2), (0, 0), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 4, 56, 56, 1, 16), 'int8'), ('TENSOR', (4, 4, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 4, 56, 56, 1, 16, 'int8'), (4, 4, 3, 3, 16, 16, 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (8, 8, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 8, 28, 28, 1, 16, 'int8'), (8, 8, 3, 3, 16, 16, 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 4, 56, 56, 1, 16), 'int8'), ('TENSOR', (8, 4, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 4, 56, 56, 1, 16, 'int8'), (8, 4, 3, 3, 16, 16, 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (16, 16, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 16, 14, 14, 1, 16, 'int8'), (16, 16, 3, 3, 16, 16, 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 8, 28, 28, 1, 16), 'int8'), ('TENSOR', (16, 8, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 8, 28, 28, 1, 16, 'int8'), (16, 8, 3, 3, 16, 16, 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 32, 7, 7, 1, 16), 'int8'), ('TENSOR', (32, 32, 3, 3, 16, 16), 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 32, 7, 7, 1, 16, 'int8'), (32, 32, 3, 3, 16, 16, 'int8'), (1, 1), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "       Task(func_name=topi_nn_conv2d, args=(('TENSOR', (1, 16, 14, 14, 1, 16), 'int8'), ('TENSOR', (32, 16, 3, 3, 16, 16), 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'), kwargs={}, workload=('conv2d', (1, 16, 14, 14, 1, 16, 'int8'), (32, 16, 3, 3, 16, 16, 'int8'), (2, 2), (1, 1), (1, 1), 'NCHW1n16c', 'int32'))\n",
        "   Tuning...\n",
        "   [Task  1/10]  Current/Best:    0.72/  23.24 GFLOPS | Progress: (480/1000) | 640.31 s Done.\n",
        "   [Task  2/10]  Current/Best:    0.00/  27.69 GFLOPS | Progress: (576/1000) | 810.09 s Done.\n",
        "   [Task  3/10]  Current/Best:    0.00/  22.97 GFLOPS | Progress: (1000/1000) | 1125.37 s Done.\n",
        "   [Task  4/10]  Current/Best:    0.00/  31.26 GFLOPS | Progress: (1000/1000) | 1025.52 s Done.\n",
        "   [Task  5/10]  Current/Best:    0.00/  15.15 GFLOPS | Progress: (1000/1000) | 1236.58 s Done.\n",
        "   [Task  6/10]  Current/Best:    0.00/  22.74 GFLOPS | Progress: (1000/1000) | 906.60 s Done.\n",
        "   [Task  7/10]  Current/Best:    0.00/  15.27 GFLOPS | Progress: (1000/1000) | 1056.25 s Done.\n",
        "   [Task  8/10]  Current/Best:    0.00/   2.18 GFLOPS | Progress: (1000/1000) | 2275.29 s Done.\n",
        "   [Task  9/10]  Current/Best:    2.23/   3.99 GFLOPS | Progress: (1000/1000) | 2527.25 s Done.\n",
        "   [Task 10/10]  Current/Best:    1.56/   6.32 GFLOPS | Progress: (480/1000) | 1304.84 s Done.\n",
        "   Compile...\n",
        "   Upload...\n",
        "   Evaluate inference time cost...\n",
        "   Mean inference time (std dev): 621.79 ms (0.14 ms)\n",
        "```"
      ]
    },
    {
      "cell_type": "markdown",
      "metadata": {},
      "source": [
        "````{admonition} 实验困难？\n",
        ":class: alert alert-info\n",
        "\n",
        "自动调优模块容易出错。如果你总是看到 \" 0.00/ 0.00 GFLOPS\"，那么一定是哪里出了问题。\n",
        "\n",
        "首先，确保您设置了正确的设备配置。然后，您可以通过在脚本的开头添加这些行来打印调试信息。它将打印每个测量结果，您可以在其中找到有用的错误消息。\n",
        "\n",
        "```python\n",
        "import logging\n",
        "logging.getLogger('autotvm').setLevel(logging.DEBUG)\n",
        "```\n",
        "\n",
        "最后，请随时在 <https://discuss.tvm.apache.org> 社区寻求帮助。\n",
        "````"
      ]
    }
  ],
  "metadata": {
    "kernelspec": {
      "display_name": "Python 3",
      "language": "python",
      "name": "python3"
    },
    "language_info": {
      "codemirror_mode": {
        "name": "ipython",
        "version": 3
      },
      "file_extension": ".py",
      "mimetype": "text/x-python",
      "name": "python",
      "nbconvert_exporter": "python",
      "pygments_lexer": "ipython3",
      "version": "3.12.2"
    }
  },
  "nbformat": 4,
  "nbformat_minor": 0
}
